ACCY 570 Group Project

Overview


Goal

Complete a data analytics project that demonstrates your mastery of the course content.

  1. Demonstrate the ability to use Markdown and basic Python to summarize data and to produce insightful visualizations.
  2. Show that you can use the proper statistical methods to summarize your data.
  3. Show that you can create useful plots to highlight the important points of your data.

Prompt

After several years of operation, the Lending Club™ wants to understand how the individual components (i.e., features) of a loan, such as the amount loaned, the term length, or the interest rate, affect the profitability of a making a specific loan. In particular, they're interested in understanding how likely a loan is to be repaid and what type of return they can expect. In this project, you will use the provided loans dataset to help the Lending Club™ figure out which loans are most profitable.


Criteria


You will work in groups of 4-5 students to analyze these data to make recommendations based on the variables in the loan.csv dataset. Specifically, you should address the following questions:

  1. What is the probability of a full repayment? and
  2. What is the return on the loan? To simplify the analysis, your group must focus on at least five features (or variables) that you think might be important (you can include more if you wish, but you must use at least five features).

You will complete three tasks for this group project:

  1. A group report in the form of a Jupyter notebook,
  2. An in-class presentation where your group will present your results, and
  3. Peer evaluation of the contributions of each member of your group.

Your final group report will be a single Jupyter notebook that will integrate Markdown, Python code, and the results from your code, such as data visualizations. Markdown cells should be used to explain any decisions you make regarding the data, to discuss any plots or visualizations generated in your notebook, and the results of your analysis. As a general guideline, the content should be written in a way that a fellow classmate (or an arbitrary data scientist/analyst) should be able to read your report and understand the results, implications, and processes that you followed to achieve your result. If printed (not that you should do this), your report should be at least fifteen pages.

Your group will present the material in-class in a format that is left up to each group. For example, you can use presentation software such as MS Powerpoint, PDFs, your Notebook, or Prezi, or, alternatively, you can choose some other presentation style (feel free to discuss your ideas with the course staff). The presentations should cover all steps in your analytics process and highlight your results. The presentation should take between eight to twelve minutes, and will be graded by your discussion teaching assistant.

Rubric

  • Notebook Report (60%)
  • Class presentation (30%)
  • Peer assessment from your group-mates (10%)

General

Your report should

  1. use proper markdown.
  2. include all of the code used for your analysis.
  3. include properly labeled plots (eg., use axis labels and titles).
  4. use a consistent style between graphs.
  5. be entirely the work of your own group, Do not plagiarize code, this includes anything you might find online. All code should be written by you and your group.

Exploratory Data Analysis (EDA)

  1. For each of the features you select, briefly explain why your group decided to include it in the analysis. Specifically, think about why you think they might be helpful in predicting which loans are likely to be repaid and which loans are likely to have the highest return. Other points to keep in mind:
    • At least one variable must be categorical and at least one must be numerical.
    • You do not need to include summaries of any feature not included in your analysis.
    • Hint: features with no (or very few) missing values are likely to be the most useful in making predictions.
  2. Create histograms to visualize each of the features included in your analysis.
    • Briefly describe any issues or irregularities that you see in the data, for example:
      • Are there any major outliers?
      • Does the data look well dispersed or does it clump around some point?
      • Are there a lot of missing values?
  3. Compute appropriate descriptive statistics for each feature.
    • Numerical features should include at least a measure of centrality and a measure of dispersion.
    • Categorical features should at a minimum indicate which category is most popular.
  4. Create histograms for both the Repaid and Return features, do they look normally distributed?
    • Fit a normal distribution to the histogram.
    • Include and discuss a QQ plot.
  5. Finally, include plots that demonstrate the relationship between each explanatory feature and each target feature.
    • You might find a box plot, violin plot, scatter plot, or heat map to be helpful.
    • Analyze these plots and note anything interesting or unusual that you see in your plots.
    • Do any features look to be strongly related to either target?
    • Comment on why you think features might be correlated.

Machine Learning

  1. Preprocess all data appropriately.

    • Normalize any continuous feature.
    • Encode any categorical feature.
    • Randomly split the data into Training and Testing data sets.
  2. Build a classifier on Training dataset to classify loans as either repaid or not.

    • Try at least two different classification algorithms and compare their performance on the Testing dataset using some measure of predictive power.
    • Include a confusion matrix for each algorithm.
    • Create a ROC curve comparing the performance of each algorithm on Testing dataset and compute and display the AUC for each algorithm.
    • Create at least one other visualization that compares their relative performance on Testing data.
    • Explain which model you prefer, include justification beyond the value of a performance metric.
  3. Build a regression model on Training dataset to predict the loan return.

    • Try at least two different regression algorithms and compare their performance on Testing dataset using some measure of predictive power.
    • Create a lift chart comparing the performance of each model.
    • Explain which model you prefer, include justification beyond the value of a performance metric.

Conclusion

Summarize the results of your analysis. This summary should include anything interesting you found when performing EDA. Also, discuss the results of each classification and regression model. Be sure to address whether your classifier was much better than random on the test data, and comment on how accurate the predictions were from your regression model. Next, be sure to discuss the importance of each feature in both machine learning tasks. Finally, comment on your results and how they might be used to improve the performance of future loans made by the Lending Tree™.


Getting Started

In order to ensure everyone starts at the same point, we provide the following Code cell that creates the two target columns that you will use for the analyses required to complete this group project. The return feature encodes the return for each loan and the repaid feature encodes whether the loan was repaid in full or not (for simplicity, we assume that the loan is repaid if the borrower pays more than the loan amount).

You should include these code cells in your own group project notebook.


In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
import warnings
warnings.filterwarnings("ignore")

loan = pd.read_csv('loan.csv', low_memory=False)
loan['return'] = (loan['total_pymnt_inv'] - loan['funded_amnt_inv']) / loan['funded_amnt_inv']
loan['repaid'] = loan['total_pymnt_inv'] > loan['funded_amnt_inv']

EDA


1.Data Processing and Feature Selection

1.Drop usefulness information

First of all, we encode the categorical features to make them applicable to SelectKBest models. Then we defined a function to preview the missing values and % of missing values in each column. As too many missing values will create a disturbance to our results, we dropped columns('emp_title', 'desc', 'mths_since_last_delinq', 'mths_since_last_record', 'next_pymnt_d', 'mths_since_last_major_derog', 'tot_coll_amt','tot_cur_bal', 'total_rev_hi_lim', 'Year', 'policy_code', 'id', 'member_id') with over a 5% missing value. Besides, we also drop rows containing missing value, nan and strings to clean our dataframe and drop columns with a low variance as little information was contained. As the result, the new dataframe was more pure and involved only useful information.

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
In [3]:
# create a new Dataframe by copying the original one so that subsequent processing will not influence each other
newloan = loan.copy(deep=True)

# encode each categorical features according to its own categories
newloan['grade'] = newloan.grade.astype('category', categories=['G', 'F', 'E', 'D', 'C', 'B', 'A']).cat.codes
newloan['term'] = newloan.term.astype('category', categories=['60 months', '36 months']).cat.codes
newloan['home_ownership'] = newloan.home_ownership.astype('category', categories=['NONE', 'OTHER', 'RENT', 'MORTGAGE', 'OWN']).cat.codes
newloan['verification_status'] = newloan.verification_status.astype('category', categories=['Not Verified', 'Source Verified', 'Verified']).cat.codes
newloan['addr_state'] = newloan.addr_state.astype('category').cat.codes
newloan['purpose'] = newloan.purpose.astype('category').cat.codes
newloan['initial_list_status'] = newloan['initial_list_status'].astype('category').cat.codes
newloan['repaid_label'] = newloan.repaid.astype(int)

# drop columns with nan & strings
newloan = newloan.dropna(axis=1,how='all')
In [4]:
# deine a function to see the missing values
def missing_values_table(df): 
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum()/len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
    columns = {0 : 'Missing Values', 1 : '% of Total Values'})
    return mis_val_table_ren_columns 

missing_values_table(newloan)
Out[4]:
Missing Values % of Total Values
id 0 0.000000
member_id 0 0.000000
loan_amnt 0 0.000000
funded_amnt 0 0.000000
funded_amnt_inv 0 0.000000
term 0 0.000000
int_rate 0 0.000000
installment 0 0.000000
grade 0 0.000000
sub_grade 0 0.000000
emp_title 5796 6.043670
emp_length 0 0.000000
home_ownership 0 0.000000
annual_inc 4 0.004171
verification_status 0 0.000000
issue_d 0 0.000000
pymnt_plan 0 0.000000
url 0 0.000000
desc 34142 35.600926
purpose 0 0.000000
title 14 0.014598
zip_code 0 0.000000
addr_state 0 0.000000
dti 0 0.000000
delinq_2yrs 29 0.030239
earliest_cr_line 29 0.030239
inq_last_6mths 29 0.030239
mths_since_last_delinq 58327 60.819378
mths_since_last_record 90777 94.656003
open_acc 29 0.030239
pub_rec 29 0.030239
revol_bal 0 0.000000
revol_util 137 0.142854
total_acc 29 0.030239
initial_list_status 0 0.000000
out_prncp 0 0.000000
out_prncp_inv 0 0.000000
total_pymnt 0 0.000000
total_pymnt_inv 0 0.000000
total_rec_prncp 0 0.000000
total_rec_int 0 0.000000
total_rec_late_fee 0 0.000000
recoveries 0 0.000000
collection_recovery_fee 0 0.000000
last_pymnt_d 143 0.149111
last_pymnt_amnt 0 0.000000
next_pymnt_d 87288 91.017914
last_credit_pull_d 4 0.004171
collections_12_mths_ex_med 145 0.151196
mths_since_last_major_derog 90519 94.386978
policy_code 0 0.000000
application_type 0 0.000000
acc_now_delinq 29 0.030239
tot_coll_amt 70276 73.278972
tot_cur_bal 70276 73.278972
total_rev_hi_lim 70276 73.278972
Year 0 0.000000
return 233 0.242956
repaid 0 0.000000
repaid_label 0 0.000000
In [5]:
# drop columns that contain over 5% missing value
newloan.drop(['emp_title','desc','mths_since_last_delinq','mths_since_last_record','next_pymnt_d','mths_since_last_major_derog',\
'tot_coll_amt','tot_cur_bal','total_rev_hi_lim','Year','policy_code', 'id', 'member_id','collections_12_mths_ex_med' ], axis=1 , inplace=True,errors= 'ignore')

# select numerical columns
cols = newloan.columns
num_cols = newloan._get_numeric_data().columns
num_loan = newloan[num_cols].copy()

# drop columns with low variance
newloan.drop(newloan.std()[newloan.std() < 0.2].index.values, axis=1)


# drop columns with nan & strings
numloan= num_loan.dropna()
featuresloan = numloan[numloan.T[numloan.dtypes!=np.object].index]

2.Select Features

First, we used the SelectKBest model to compute the related score of each feature. Then we chose 20 features with over 100 scores. In the following part, we will draw histograms, describe statistics and show plots for the 20 features and select 9 for machine learning part according to their characteristics reflected.

In [6]:
# SelectBest model
features = featuresloan.iloc[:,2:-2]
labels= featuresloan[[-1]]

skb = SelectKBest(k='all')

fs=skb.fit(features, labels)

for var, name in sorted(zip(fs.scores_, featuresloan.columns), key=lambda x: x[0], reverse=True):
   print(f'{name:>18} score = {var:5.3f}')
   last_pymnt_amnt score = 242702.554
       total_pymnt score = 12795.172
     total_rec_int score = 12425.808
         out_prncp score = 5696.890
     out_prncp_inv score = 5505.795
        recoveries score = 5064.032
       funded_amnt score = 3495.254
   funded_amnt_inv score = 3087.015
          int_rate score = 2631.516
total_rec_late_fee score = 2016.180
         total_acc score = 1424.465
initial_list_status score = 1424.410
   total_rec_prncp score = 925.118
    home_ownership score = 679.158
         loan_amnt score = 542.298
               dti score = 522.294
           purpose score = 420.445
           pub_rec score = 402.746
              term score = 224.587
             grade score = 105.529
        annual_inc score = 79.068
   total_pymnt_inv score = 60.354
    inq_last_6mths score = 57.293
       delinq_2yrs score = 29.735
        revol_util score = 29.678
          open_acc score = 27.221
verification_status score = 16.756
        addr_state score = 9.666
       installment score = 1.104
         revol_bal score = 0.070
collection_recovery_fee score = 0.000

2.Histograms

Histograms for categorical features

Histograms above show the distribution for each selected categorical feature. It's obvious and clear for us to figure out how many kinds of categories each feature has, which is most popular and which is least. Grade, for example, has 7 categories and B is the most common one while G is the least. As for home ownership, rent and mortgage are almost the same in the first place followed by own, while other and none are the least common one. For features that just have 2 categories, such as term and initial_list_status, we can compare between these 2 and find the greater or common one. For instance, 36 months is more popular than 60 months probably due to interest rate or the credit.

In [7]:
f, ax = plt.subplots(figsize=(10, 8))
# adjust the order of grade for better view
ax=sns.countplot('grade',data=loan, order=["A", "B","C","D","E","F","G"])
ax.set_xlabel('Grade', fontsize=15)
ax.set_ylabel('Frequency', fontsize=15)
plt.show()

# use for loop to draw most categorical features' histograms
x=['initial_list_status', 'home_ownership', 'pub_rec', 'term']
for i in range(len(x)):
    f, ax = plt.subplots(figsize=(10, 8))
    ax=sns.countplot(x[i],data=loan)
    ax.set_xlabel(x[i], fontsize=15)
    ax.set_ylabel('Frequency', fontsize=15)
plt.show()

# draw seperately in large size to view better becase of too many different kinds of categories in 'purpose'
f, ax = plt.subplots(figsize=(25, 8))
ax=sns.countplot(x="purpose",data=loan)
ax.set_xlabel('Purpose', fontsize=15)
ax.set_ylabel('Frequency', fontsize=15)
plt.show()

Histograms for numerical features

From all these histograms above, we can see clearly that total payment, Interest received to date, funded amount, funded amount by investors, interest rate, total number of credit lines, loan amount, and dti features look well dispersed, while last payment amount, remaining outstanding principal, remaining outstanding principal by investors, recoveries and late fees received to date just clump around a few points. It is worth mentioning that the plot of dti (the ratio) and total number of credit lines look like normal distributions a lot. As for missing value, since it is required to not have any missing value before drawing the histogram, so all the histograms presented are plotted without missing values. To see the missing value of each features, you can see the missing value table presented above.

In [8]:
y=['last_pymnt_amnt', 'total_pymnt', 'total_rec_int', 'out_prncp', 'out_prncp_inv', 'recoveries', 'funded_amnt', 'funded_amnt_inv', 'int_rate', 'total_rec_late_fee', 'total_acc', 'loan_amnt','dti']

# use for loop to draw all the numerical features' histograms
for i in range(len(y)):
    ax = sns.distplot(featuresloan[y[i]], kde=True)
    ax.set_xlabel(y[i], fontsize=14)
    ax.set_ylabel("Frequency", fontsize=14)
    plt.show()

3.Appropriate descriptive statistics for features

After we use the .describe() function , it can show the result in verbal version like which is the top categories and the frequency of it. For example, the most popular category of home ownership is rent and for the purpose is debt consolidation. Regarding term, the more common one is 36 months probably because people prefer short-term loans in a lower interest rate. We can also find that the most popular category in initial list status is fractional(f) far more than the whole. So this is also a good and straight-forward way for us to analyze the features and data besides using the histogram.

We think for numerical features,the descriptive statistics are much easier for us to analyze and understand the feature. For example, we can figure out that the maximum value for last_pymnt_amnt is 36115, far more than the 75% point(3806) due to the high standard deviation. As to loan_amnt, most amounts are around 1000 to 1700 with a standard deviation of 7883. Interest rate, however, may have a smaller range with the maximum 24.89% and minimum 5.42%. But still it is a huge range for the borrower. It is interesting to see that out_prncp and out_prncp_inv are similar with a really high range, which means that some repaid all the money while others cannot afford a large amount of debt and cannot repay. We can use descriptive statistics to analyze many other features.

In [9]:
x=['grade', 'initial_list_status', 'home_ownership', 'term', 'purpose']

# use for loop to describe all the categorical features
for i in range(len(x)):
    a = loan[x[i]].describe()
    print(a)
count     95902
unique        7
top           B
freq      30896
Name: grade, dtype: object
count     95902
unique        2
top           f
freq      92000
Name: initial_list_status, dtype: object
count     95902
unique        5
top        RENT
freq      44310
Name: home_ownership, dtype: object
count         95902
unique            2
top       36 months
freq          75004
Name: term, dtype: object
count                  95902
unique                    14
top       debt_consolidation
freq                   50587
Name: purpose, dtype: object
In [10]:
y=['last_pymnt_amnt', 'total_pymnt', 'total_rec_int', 'out_prncp', 'out_prncp_inv', 'recoveries', 'funded_amnt', 'funded_amnt_inv', 'int_rate', 'total_rec_late_fee', 'total_acc', 'loan_amnt','dti']

# use for loop to describe all the numerical features
for i in range(len(y)):
    b = featuresloan[y[i]].describe()
    print(b)
count    95541.000000
mean      3077.055864
std       5031.414209
min          0.000000
25%        277.760000
50%        628.150000
75%       3806.000000
max      36115.200000
Name: last_pymnt_amnt, dtype: float64
count    95541.000000
mean     13357.453332
std       9419.343020
min          0.000000
25%       6281.480000
50%      11284.340000
75%      18094.710000
max      57777.579870
Name: total_pymnt, dtype: float64
count    95541.000000
mean      2640.679365
std       2960.076699
min          0.000000
25%        805.270000
50%       1642.200000
75%       3223.310000
max      24205.620000
Name: total_rec_int, dtype: float64
count    95541.000000
mean       395.366968
std       1899.946333
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max      20781.260000
Name: out_prncp, dtype: float64
count    95541.000000
mean       394.487495
std       1896.829436
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max      20781.260000
Name: out_prncp_inv, dtype: float64
count    95541.000000
mean       120.002251
std        692.167254
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max      33520.270000
Name: recoveries, dtype: float64
count    95541.000000
mean     12310.872034
std       7789.320810
min        500.000000
25%       6000.000000
50%      10000.000000
75%      16500.000000
max      35000.000000
Name: funded_amnt, dtype: float64
count    95541.000000
mean     12008.987362
std       7823.070695
min          0.000121
25%       6000.000000
50%      10000.000000
75%      16000.000000
max      35000.000000
Name: funded_amnt_inv, dtype: float64
count    95541.000000
mean        12.984734
std          4.157473
min          5.420000
25%          9.990000
50%         13.060000
75%         15.700000
max         24.890000
Name: int_rate, dtype: float64
count    95541.000000
mean         1.138962
std          6.795951
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max        286.747566
Name: total_rec_late_fee, dtype: float64
count    95541.000000
mean        22.966402
std         11.248072
min          1.000000
25%         15.000000
50%         21.000000
75%         30.000000
max         99.000000
Name: total_acc, dtype: float64
count    95541.000000
mean     12433.654138
std       7883.392208
min        500.000000
25%       6000.000000
50%      10000.000000
75%      17000.000000
max      35000.000000
Name: loan_amnt, dtype: float64
count    95541.000000
mean        15.212987
std          7.397553
min          0.000000
25%          9.670000
50%         15.100000
75%         20.550000
max         34.990000
Name: dti, dtype: float64

I think for numerical features,the descriptive statistics are much easier for us to analyze and understand the feature. For example, we can figure out that the maximum value for last_pymnt_amnt is 36115, far more than the 75% point(3806) due to the high standard deviation. We can use descriptive statistics to analyze other features.

4.Fit a normal distribution to the histogram for both the Repaid and Return features

In [11]:
fig, ax = plt.subplots(figsize=(15,10))
ax.hist(featuresloan['return'], normed=True, bins=50)
x = plt.xticks()[0]  
# get minimum and maximum value of x
xmin, xmax = min(x), max(x) 
# evenly spaced numbers over a interval
lnspc = np.linspace(xmin, xmax, len(featuresloan['return']))
# get mean and standard deviation
m, s = stats.norm.fit(featuresloan['return'])   
# now get theoretical values in our interval 
pdf = stats.norm.pdf(lnspc, m, s)  
# plot the histogram and fit the normal distribution
plt.plot(lnspc, pdf, label="Norm")
ax.set_xlabel("return(%)", fontsize=18)
ax.set_ylabel("Frequency", fontsize=18)
ax.legend(loc=1, fontsize = 'x-large')
plt.show()
In [12]:
fig, ax = plt.subplots(figsize=(18,10))
ax.hist(featuresloan['repaid'], normed=True, bins=50)
x = plt.xticks()[0]  
# get minimum and maximum value of x
xmin, xmax = min(x), max(x)  
# evenly spaced numbers over a interval
lnspc = np.linspace(xmin, xmax, len(featuresloan['repaid']))
# get mean and standard deviation
m, s = stats.norm.fit(featuresloan['repaid'])
# now get theoretical values in our interval
pdf_g = stats.norm.pdf(lnspc, m, s)
# plot the histogram and fit the normal distribution
plt.plot(lnspc, pdf_g, label="Norm") 
ax.set_xlabel("repaid", fontsize=18)
ax.set_ylabel("Frequency", fontsize=18)
ax.legend(loc=1, fontsize = 'x-large')
plt.show()

So we can see clearly from the histograms that return feature is more well-fitted to the normal distribution than repaid. This is probably because repaid feature is categorical feature with just two categories.

5. Plots demonstrate the relationship between each explanatory feature and target feature.

We used violin plots for categorical features and used scatter plots for numerical features. Because 'total_pymnt_inv','funded_amnt_inv', and 'total_pymnt' are related to compute 'repaid' and 'return', we have not considered these features into our consideration. Next we have chosen some plots more representive involving 'grades', 'home_ownership', 'pub_rec', 'term','last_pymnt_amnt','total_rec_int','out_prncp','out_prncp_inv','recoveries','int_rate'.


1.For Categorical Features

From this part we can see:

  • For 'grades', I think its strong related to the target feature. We draw the point that if the account repaid money, the return increase as grades go bad, and that if the account is failed to repay, the return decrease as grades going bad. From our point of view, this is because as the grades going bad, the individual or company etc, needs to pay more financing cost to borrow the same amount of money. So if it returns the money the return can be high, otherwise it can be low. Besides, we can see an interesting point that people with low grade are more likely to borrow money as cover area for lower grade part is larger. We think maybe low-grade-people are more likely to be in the situation lacking of funding.

  • For 'term', we can also see the strong relationship with target feature. Longer the term, better the return results. We think the reason is that longer term involves more risk that the borrower will not repaid money, so the company will charge more return to balance the risk. The interesting point is that if individual failed to repay, the mean amount of return owed for longer term is less than which for the shorter. The reason we guess is the interests are higher for long-term, the borrower has repaid more money before although it has not repaid the full amount.

  • For 'home_ownership', we used to think people who own home might have more possibility to repay money because it means they can afford to buy a house or an apartment. We cannot see the apparent relationship of target features with different catagories of home ownership since the cover area(number of people) and mean amount of return of different types show no difference.

In [13]:
# Define plot layout
# adjust the order of grade for better view
axs = sns.violinplot(x="grade", y="return", hue="repaid", data=loan, order=["A", "B","C","D","E","F","G"])
sns.despine(left=True, offset=10)
plt.show()

x=['initial_list_status', 'home_ownership', 'pub_rec', 'term']
# use for loop to draw most categorical features' violinplot
for i in range(len(x)):
    f, ax = plt.subplots(figsize=(10, 8))
    ax=sns.violinplot(x[i],y="return", hue="repaid", data=loan)
    ax.set_xlabel(x[i], fontsize=15)
    ax.set_ylabel('Return', fontsize=15)
plt.show()

# draw seperately in large size to view better becase of too many different kinds of categories in 'purpose'
f, ax = plt.subplots(figsize=(25, 10))
ax=sns.violinplot(x="purpose",y="return", hue="repaid", data=loan)
ax.set_xlabel('Purpose', fontsize=15)
ax.set_ylabel('Return', fontsize=15)
plt.show()

2.For Numerical Features

As these features are continous variables, we think scatter plot can be suitable to show all the data.

  • For 'last_pymnt_amnt', in our point of view, it has much to do with the target features. The interesting thing is if the borrower failed to repay, the last payment amount is very low.We think this is because their last payment amount is not the principal amount, and is just some interest. So, if we know the last payment amount, we have the possibilty to know if the borrower will repay and how much return we will get from them. For exampe, if the amount is high, it nearly means we got the repaid.

  • For 'total_rec_int', we think it also matters to predict return and repay. The scatter plot shows that large number of borrowers who failed to repay the principal amount paid relatively less interest than those who repaid. Besides, we can see more of the total interests and the less loss we got(return) from the borrowers.

  • For 'out_prncp', there is some relationship between it with the target features. We can see the more the remaining outstanding principal amount, the less return we will get from the borrowers. Besides we can see that the amount of remaining outstanding principal of borrowers who failed to repay is centralized around 10000 while which for those repaid is centralized in a very low amount. This does make sense because if repaid, there will be no remaining principal except that the borrower has borrowed money with high interest rates so total payment is larger than the funded amount although principal has not been totally paid back.

  • For'out_prncp_inv', it just shows the same characteristics as the 'out_prncp' feature.

  • For'recoveries', we think this feature has something to do with the target feature. Because we can see that larger part of people with the post charge off recovery amount have not repaid finally, so we can use it to predict. But there remains an interesting thing that the larger the amount, the less return we will get from people paid back but less money lossed from borrowers who failed to repay.

  • For 'int_rate', the relationship is strong. For people who repaid, of course we will get more return if the interest rate is higher,so we can use it to calculate how much return we will get. But we have not seen if the money will be repaid from this plot.

In [16]:
# select samples from data for a better view of the scatterplot
aloan=loan.sample(8000)
y=['last_pymnt_amnt', 'total_pymnt', 'total_rec_int', 'out_prncp', 'out_prncp_inv', 'recoveries', 'funded_amnt', 'funded_amnt_inv', 'int_rate', 'total_rec_late_fee', 'total_acc', 'loan_amnt','dti']
# use for loop to draw all the numerical features' scatterplot
for i in range(len(y)):
    plt.figure(figsize=(20,10))
    plt.scatter(aloan[y[i]][aloan['repaid'] == True], aloan['return'][aloan['repaid'] == True],label='True')
    plt.scatter(aloan[y[i]][aloan['repaid'] == False], aloan['return'][aloan['repaid'] == False],label='False')
    plt.xlabel(y[i], fontsize=18)
    plt.ylabel('return', fontsize=16)
    plt.show()

Machine Learning


In [17]:
import numpy as np
import pandas as pd
from sklearn.feature_selection import SelectKBest
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib import cm
% matplotlib inline

# Standard imports
import seaborn as sns

sns.set(style="white")
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.model_selection import KFold, cross_val_score,cross_val_predict
import random

1. Training and Testing Dataset for Regression

In [57]:
fea_sel_reg = featuresloan[['last_pymnt_amnt','total_rec_int','out_prncp', 'recoveries','out_prncp_inv','int_rate']].as_matrix()

return_data = featuresloan[['return']].as_matrix()

repaid_data = featuresloan['repaid']

## for regression

reg_train, reg_test, return_train, return_test= train_test_split(fea_sel_reg,return_data,test_size=0.4, random_state=23)

## for classification
X_train,X_test, y_train, y_test= train_test_split(fea_sel_reg,repaid_data,test_size=0.4, random_state=23)

2.Build a classifier on Training dataset to classify loans as either repaid or not.

LogisticRegression

In [37]:
#Classification method 1
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression()
log_reg = log_reg.fit(X_train, y_train)
In [38]:
from sklearn import metrics

prediction_test = log_reg.predict(X_test)

print(100*metrics.accuracy_score(y_test, prediction_test))
91.0484862757
In [39]:
#Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

labels = ['Non-Repaid', 'Repaid']
# Create and display confusion matrix
print(confusion_matrix(y_test, prediction_test))
[[ 2431  3269]
 [  152 32365]]
In [40]:
# Probability
print(classification_report(y_test, prediction_test, \
                            target_names = labels))
             precision    recall  f1-score   support

 Non-Repaid       0.94      0.43      0.59      5700
     Repaid       0.91      1.00      0.95     32517

avg / total       0.91      0.91      0.90     38217

In [41]:
#Confusion matrix plot
class_names = ['Non-Repaid', 'Repaid'] 
import itertools
class_names = ['Non-Repaid', 'Repaid']
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
# Compute confusion matrix
conf_log = confusion_matrix(y_test, prediction_test)
np.set_printoptions(precision=2)
#cnf_matrix = confusion_matrix(y_test, y_pred)

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(conf_log, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(conf_log, classes=class_names,
                      title='Confusion matrix, without normalization')

plt.show()
#sklearn
Normalized confusion matrix
[[ 0.43  0.57]
 [ 0.    1.  ]]
Confusion matrix, without normalization
[[ 2431  3269]
 [  152 32365]]

RandomForestClassifier

In [42]:
# Classification method 2
from sklearn.ensemble import RandomForestClassifier  

rfc = RandomForestClassifier()
# Fit estimator to scaled training data
rfc = rfc.fit(X_train, y_train)
In [43]:
predict_test = rfc.predict(X_test)
print(100*metrics.accuracy_score(y_test, predict_test))
95.4287359029
In [44]:
#Confusion Matrix
from sklearn.metrics import confusion_matrix

# Create and display confusion matrix
print(confusion_matrix(y_test, predict_test))
[[ 4753   947]
 [  800 31717]]
In [45]:
#Probability
print(classification_report(y_test, predict_test, \
                            target_names = labels))
             precision    recall  f1-score   support

 Non-Repaid       0.86      0.83      0.84      5700
     Repaid       0.97      0.98      0.97     32517

avg / total       0.95      0.95      0.95     38217

In [46]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
# Compute confusion matrix
conf_rfc = confusion_matrix(y_test, predict_test)
np.set_printoptions(precision=2)

# Plot normalized confusion matrix
plt.figure()
plot_confusion_matrix(conf_rfc, classes=class_names, normalize=True,
                      title='Normalized confusion matrix')

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(conf_rfc, classes=class_names,
                      title='Confusion matrix, without normalization')

plt.show()
#sklearn
Normalized confusion matrix
[[ 0.83  0.17]
 [ 0.02  0.98]]
Confusion matrix, without normalization
[[ 4753   947]
 [  800 31717]]

ROC Curve, AUC for both Classification Methods

In [47]:
from sklearn.metrics import roc_curve, auc
from sklearn.metrics import auc

y_score_log = log_reg.predict_proba(X_test)[:, 1]
# Roc curve and ROC area
xclass, yclass, _ = roc_curve(y_test, y_score_log)
roc_auc_log = auc(xclass, yclass)
In [48]:
y_score_rfc = rfc.predict_proba(X_test)[:, 1] #try this if decision_function does not suffice

# ROC curve and ROC area

xclass2, yclass2, _ = roc_curve(y_test, y_score_rfc)
roc_auc_rfc = auc(xclass2, yclass2)
In [49]:
# Plot data and model
fig, ax = plt.subplots(figsize=(8, 8))
import seaborn as sns


ax.plot(xclass, yclass, alpha = 0.75, linestyle='-',
        label=f'Log (AUC = {roc_auc_log:4.2f})')

ax.plot(xclass2, yclass2, alpha = 0.5, linestyle='-.',
        label=f'Rfc (AUC = {roc_auc_rfc:4.2f})')

ax.plot([0, 0, 1], [0, 1, 1], alpha = 0.5, 
        lw=1, linestyle='-.', label='Perfect')

# Decorate plot appropriately
ax.set_title('Receiver Operating Characteristic Curve', fontsize=18)
ax.set_xlabel('False Positive Rate', fontsize=16)
ax.set_ylabel('True Positive Rate', fontsize=16)
ax.set_xlim(-0.05, 1.05)
ax.set_ylim(-0.05, 1.05)
ax.set_aspect('equal')
ax.legend(loc=4, fontsize=16)
sns.despine(offset=5, trim=True)
plt.show()

Final Visual: Gain Chart

In [50]:
# Function computes the gain from a given scikit-learn estimator and the test data features and labels

def compute_gain(mdl, d_test, l_test):
    
    if hasattr(mdl, 'decision_function'):
    
        # label for computing the gain.
        prbas = mdl.decision_function(d_test)
        pos_score = prbas

    else:
        # labels we need to compute the gain.
        prbas = mdl.predict_proba(d_test)
        pos_score = prbas[:,1]

    # Generate class membership
  
    clm = pd.get_dummies(l_test).as_matrix()
    
    # Second column indicates 'positive' label
    pos_lbl = clm[:,1]
    
    # Compute total number of 'positive' labels
    n_pos_lbl = np.sum(pos_lbl)
    
    # Generate sorted (ascending) index array by 'positive' score
    idx = np.argsort(pos_score) 
    
    # Now sort 'positive' labels by the sorted index
    sort_pos = pos_lbl[idx[::-1]]
    
    # Compute cumulative sum
    cum_sum_pos = np.cumsum(sort_pos)
    
    # lift is the ratio of cumulative improvement to all data 
    lift = cum_sum_pos/n_pos_lbl
    
    # Number of test instances
    num_test = l_test.shape[0]
    
    # Amount of test data included as a percentage of total
    tst_pcnt = np.arange(1,num_test + 1, 1) /num_test
    
    return tst_pcnt, lift
In [51]:
#This is for the Gain chart

# Compute gains for different estimators

x_log, gain_log = compute_gain(log_reg, X_test, y_test)
x_rfc, gain_rfc = compute_gain(rfc, X_test, y_test)

# Plot
fig, ax = plt.subplots(figsize=(10, 10))


ax.plot(x_log, x_log, alpha = 0.5, linestyle='-',
        label=f'Baseline')

ax.plot(x_log, gain_log, alpha = 0.75, linestyle='-.',
        label=f'Log Gain')

ax.plot(x_rfc, gain_rfc, alpha = 0.75, linestyle=':',
        label=f'Rfc Gain')


# Decorate plot appropriately
ax.set_title('Gain Chart', fontsize=18)
ax.set_xlabel('Test Data', fontsize=16)
ax.set_ylabel('Gain', fontsize=16)
ax.set_xlim(-0.05, 1.05)
ax.set_ylim(-0.05, 1.05)
ax.set_aspect('equal')
ax.legend(loc=4, fontsize=16)
sns.despine(offset=5, trim=True)
plt.show()

Beyond the value of a performance metric, the model that we prefer is the Random Forest Classifier. The Logistic Regression method is one of the most common methods to use because of its similarities to linear regression, its efficiency, and tendency to avoid overfitting, but the Random Forest Classifier is a better model overall. Random Forest also avoids overfitting, and uses a random subset of features to produce invididual decision trees that are less sensitive to minor fluctuations. This attribute brings the most improvements to the Random Forest Classifier when making its predictions.

After reviewing the results, the probability of a full repayment was high with the Random Forest Classifier model having an 86% accuracy and the Logistic Regression model having an 88.8% accuracy score. Coincidentally, both models exhibited the same number of loan repayments, having 35 total repayments, 10 unrepaid loans, and an overall 77.78% loan repayment success rate. The features that were taken into account were 6 continuous variables and 3 discrete variables: 'ast_pymnt_amnt', 'total_rec_int', 'out_prncp', 'out_prncp_inv', 'recoveries', 'int_rate', 'grade', 'home_ownership', 'term'. The numerical features chosen were selected based on the highest scores received from the SelectKBest feature selection method, and the categorical features were chosen based on the strong relationship seen between loan return and loan repaid in the Exploratory Data Analysis. Lastly, the classifier was better and more effective than choosing random features on the test data. The models have metrics that exemplify how the models are more accurate and provide a stronger sense of security as opposed to simply using random features on the test data.

Three visuals were used: the Confusion matrix, the Roc Curve/Auc, and the Gain Chart. The Confusion matrix, which was used to calculate the probability of repayment, showcases the probability of a repaid loan and non-repaid loan, while also including a normalized Confusion matrix. Additionally, the Roc Curve provides a visual on the effectiveness of each classication model against one another. Along with the Roc curve, the Area under the Curve (AUC) is a measure on the quality of the classification model. A perfect model has an AUC score of 1 while most models fall between 0.5 and 1. The Gain Chart is another visual that shows the results obtained with the model versus the results unobtained by the model.

Based on these features, we are confident Lending Tree can take advantage of the classification models to plan wisely and make better business decisions. By using the models' capabilities, Lending Tree can see how likely a person would repay back their loans.

3.Build a regression model on Training dataset to predict the loan return.

Regression Analysis

We use the following regression models (with their corresponding R-squared scores) to predict the return on the loan based on six numerical features: last_pymnt_amnt, total_rec_int,out_prncp,recoveries,out_prncp_inv,int_rate.

  • Liner Regression : 19.20%
  • Random Forest Regression : 69.1%
  • Random Forest Regression with Cross Validation : 68.6%
  • Extremely Randomized Trees Regression : 69.2%
  • Decision Tree Regression : 50.2%
  • Gradient Tree Boosting Regression : 63.2%
  • Adaboost Regression : 15.4%
  • Pipeline Regression : 69.1%

Predictive power consists of two primary components: goodness of fit and predictive accuracy. Goodness of fit is training error, indicating how well a model predicts the data points that have been already used to estimate its parameters. Predictive accuracy, however, is testing error, measuring how well a model can predict new data points, for which the true value hasn't been seen. Since R2 can be used to quantify both goodness of fit and predictive accuracy, we believe in general the higher r2 score, the more precise predictions are generated. Out of regard for this factor, we prefer the following models:

  • Random Forest Regression (RFR with Cross Validation)
  • Extremely Randomized Trees Regression
  • Pipeline Regression Score

But R-squared has several key limitations. R-sqaured cannot identify whether prediction and coeffient estimates are biased. Besides, R-squared does not indicate whether a regression model is adequate. Therefore, we will further analyze the residual plots. Recall that residual is unpredictable random part of each data point, we expect the residuals to be randomly scattered in the plot without showing any systematic pattern. That is, ideally, they are

  1. tending to cluster toward the middle of plots and symmetrically distributed
  2. clustered around the lower side of the y axis
  3. in general not displaying clear patterns

We applied these three criterias to evaluate the residual plots of above regression models. Residual plots for Linear Regression model don't meet those requirements since the plots have outliers or they aren't evenly distributed vertically. While Adaboost Regression has a low R2 score, its residuals are similar to those of Random Forest Regression, Extremely Randomized Tree Regression as well as Pipeline regression, relatively to be scattered randomly in the plots, which indicate the good fitness of this model.

Based on the analysis above, we prefer to use Extereme Randomized Tree and Adaboost Regression model to predict the return.


What is the return on the loan?

Extreme Randomized Tree Regression Model generated the following predicted return statistics:

  • Predicted maximum return= 62.4%
  • Predicted minimum return = -100.0%
  • Predicted average return = 8.4%
  • Predicted std of return = 22.8%

Adaboost Regression Model generated a smaller range of predicted return with lower standard deviation:

  • Predicted maximum return= 28.0%
  • Predicted minimum return = -64.5%
  • Predicted average return = -2.8%
  • Predicted std of return = 17.2%

Based on above statistics, we belive the return on the loan is in the range of -100% to 62.4%. The average of return on the loan is in the range of -2.8% to 8.4%.

In [64]:
def make_res_plot(ind_test, dep_test, features, results,m):
  
    fig, ax = plt.subplots()
    ax.scatter(ind_test, dep_test - results, label='Testing Data')
    ax.hlines(0, 0, m, color='r', alpha= 0.25)
    ax.set_xlabel(features, fontsize=14)
    ax.set_ylabel("Residual", fontsize=14)
    ax.set_title("Regression Plot (model residuals)", fontsize=14)
    
    
    
    return ax
In [65]:
def make_Reg_plt(reg_test,return_test,pred):
    n=4000
    return_test_rand = np.random.choice(return_test.reshape(-1),n)
    pred_rand = np.random.choice(pred,n)
    make_res_plot(np.random.choice(reg_test[:,0], n), return_test_rand, 'last_pymnt_amnt', pred_rand,35000)
    make_res_plot(np.random.choice(reg_test[:,1],n), return_test_rand, 'total_rec_int', pred_rand,25000)
    make_res_plot(np.random.choice(reg_test[:,2],n), return_test_rand, 'out_prncp', pred_rand,20000)
    make_res_plot(np.random.choice(reg_test[:,3],n), return_test_rand, 'recoveries',  pred_rand,20000)
    make_res_plot(np.random.choice(reg_test[:,4],n), return_test_rand, 'out_prncp_inv',  pred_rand,20000)
    make_res_plot(np.random.choice(reg_test[:,5],n), return_test_rand, 'int_rate',  pred_rand,50)
    
    return
In [66]:
def reg_statistics(result):
    print('Predicted maximum return= {:.1%}'.format(np.max(result)))
    print('Predicted minimum return = {:.1%}'.format(np.min(result)))
    print('Predicted average return = {:.1%}'.format(np.mean(result)))
    print('Predicted std of return = {:.1%}'.format(np.std(result)))
    return

    reg_statistics(result)

Linear Regression

In [67]:
model = LinearRegression(fit_intercept = True)
model.fit(reg_train,return_train)
result = model.predict(reg_test)
score = 100.0 * model.score(reg_test, return_test)

print(f'Multivariate LR Model score = {score:5.2f}%')
Multivariate LR Model score = 18.79%
In [68]:
make_Reg_plt(reg_test,return_test,result.reshape(-1))

Predicted Return Statistics

In [69]:
reg_statistics(result)
Predicted maximum return= 80.8%
Predicted minimum return = -359.2%
Predicted average return = 8.3%
Predicted std of return = 11.3%

Random Forest Regression and RFR With Cross Validation

In [70]:
## Random Forest: Regression 
regressor = RandomForestRegressor(random_state=23)
auto_model = regressor.fit(reg_train, return_train)
print('Score = {:.1%}'.format(auto_model.score(reg_test, return_test)))
pred = auto_model.predict(reg_test)
mr2 = r2_score(return_test, pred)
print(f'R^2 Score = {mr2:5.3f}')

## Random Forest Regression with Cross Validation
kf = KFold(n_splits=6, random_state=23)
scores = cross_val_score(regressor, reg_train, return_train, cv=kf)
mean_score=np.mean(scores)

print('CV Score = {:.1%}'.format(mean_score))
Score = 69.4%
R^2 Score = 0.694
CV Score = 68.7%
In [73]:
make_Reg_plt(reg_test,return_test,pred)

Predicted Return Statistics

In [74]:
reg_statistics(pred)
Predicted maximum return= 66.8%
Predicted minimum return = -100.0%
Predicted average return = 8.5%
Predicted std of return = 22.8%

Extremely Randomized Trees: Regression

In [75]:
## Extremely Randomized Trees: Regression
auto_model = ExtraTreesRegressor(random_state=23)
auto_model = auto_model.fit(reg_train, return_train)
print('Score = {:.1%}'.format(auto_model.score(reg_test, return_test)))

pred = auto_model.predict(reg_test)
mr2 = r2_score(return_test, pred)
print(f'R^2 Score = {mr2:5.3f}')
Score = 68.7%
R^2 Score = 0.687
In [76]:
make_Reg_plt(reg_test,return_test,pred)

Predicted Return Statistics

In [77]:
reg_statistics(pred)
Predicted maximum return= 64.0%
Predicted minimum return = -100.0%
Predicted average return = 8.4%
Predicted std of return = 22.6%

Decision Tree Regression

In [78]:
#Decision Tree Regression
auto_model = DecisionTreeRegressor(random_state=23)
auto_model = auto_model.fit(reg_train, return_train)
print('Score = {:.1%}'.format(auto_model.score(reg_test, return_test)))

pred = auto_model.predict(reg_test)
mr2 = r2_score(return_test, pred)
print(f'R^2 Score = {mr2:5.3f}')
Score = 46.5%
R^2 Score = 0.465

Gradient Tree Boosting: Regression

In [79]:
## Gradient Tree Boosting: Regression
auto_model = GradientBoostingRegressor(random_state=23)
auto_model = auto_model.fit(reg_train, return_train)
print('Score = {:.1%}'.format(auto_model.score(reg_test, return_test)))

pred = auto_model.predict(reg_test)
mr2 = r2_score(return_test, pred)
print(f'R^2 Score = {mr2:5.3f}')
Score = 62.4%
R^2 Score = 0.624

AdaBoost Regression

In [80]:
##AdaBoost Regression
auto_model = AdaBoostRegressor(random_state=23)
auto_model = auto_model.fit(reg_train, return_train)
print('Score = {:.1%}'.format(auto_model.score(reg_test, return_test)))

pred = auto_model.predict(reg_test)
mr2 = r2_score(return_test, pred)
print(f'R^2 Score = {mr2:5.3f}')
Score = 9.0%
R^2 Score = 0.090
In [81]:
make_Reg_plt(reg_test,return_test,pred)
In [82]:
reg_statistics(pred)
Predicted maximum return= 27.1%
Predicted minimum return = -65.4%
Predicted average return = -3.4%
Predicted std of return = 17.2%

Pipeline Regression

In [83]:
## Pipeline Regression
auto_model_p = RandomForestRegressor(random_state=23)
am_reg = Pipeline([('RFR', auto_model_p)])
am_reg.set_params(RFR__random_state=23)
am_reg.fit(reg_train, return_train)
print('Score = {:.1%}'.format(am_reg.score(reg_test, return_test)))

pred = auto_model_p.predict(reg_test)
mr2 = r2_score(return_test, pred)
print(f'R^2 Score = {mr2:5.3f}')
Score = 69.4%
R^2 Score = 0.694
In [84]:
make_Reg_plt(reg_test,return_test,pred)

Predicted Return Statistics

In [85]:
reg_statistics(pred)
Predicted maximum return= 66.8%
Predicted minimum return = -100.0%
Predicted average return = 8.5%
Predicted std of return = 22.8%

Lasso Regression

In [86]:
## Lasso regression

def Lasso_r(X_train, y_train, X_test, y_test, alpha, random_state):
    lasso = Lasso (alpha = alpha, random_state = random_state)
    
    reg = lasso.fit(X_train, y_train)
    
    r2 = reg.score(X_test, y_test)
    
    return r2
    
print('Score = {:.1%}'.format(Lasso_r(reg_train, return_train, reg_test, return_test,1, 23)))

## lasso Graph
alphas = np.arange(1, 250, 1)
scores = []
for alpha in alphas:
    scores.append(Lasso_r(reg_train, return_train, reg_test, return_test, alpha, 23))

plt.plot(alphas, scores, label="r2 score vs alpha")
plt.title("Lasso Regression", fontsize=18)
plt.legend(loc='best')
Score = 17.8%
Out[86]:
<matplotlib.legend.Legend at 0x7fc7b22d9748>
In [ ]: